Skip to content

feat(evaluators): add new lluna client#213

Open
namrataghadi-galileo wants to merge 24 commits into
mainfrom
feature/64546-add-new-luna-client
Open

feat(evaluators): add new lluna client#213
namrataghadi-galileo wants to merge 24 commits into
mainfrom
feature/64546-add-new-luna-client

Conversation

@namrataghadi-galileo
Copy link
Copy Markdown
Contributor

Summary

  • What changed and why.

Scope

  • User-facing/API changes:
  • Internal changes:
  • Out of scope:

Risk and Rollout

  • Risk level: low / medium / high
  • Rollback plan:

Testing

  • Added or updated automated tests
  • Ran make check (or explained why not)
  • Manually verified behavior

Checklist

  • Linked issue/spec (if applicable)
  • Updated docs/examples for user-facing changes
  • Included any required follow-up tasks

@codecov
Copy link
Copy Markdown

codecov Bot commented May 6, 2026

@namrataghadi-galileo namrataghadi-galileo changed the title feat(evaluator): add new lluna client feat(evaluators): add new lluna client May 7, 2026
Comment on lines +41 to +52
def luna_config() -> dict[str, Any]:
"""Build the direct Luna evaluator config used by the composite control."""
config: dict[str, Any] = {
"scorer_label": LUNA_SCORER_LABEL,
"threshold": LUNA_THRESHOLD,
"operator": "gte",
"payload_field": "output",
"on_error": "allow",
}
if GALILEO_PROJECT_ID:
config["project_id"] = GALILEO_PROJECT_ID
return config
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

payload_field, on_error, and project_id are set at the top level of the evaluator config, but LunaEvaluatorConfig declares none of them and the shared BaseModel uses extra="ignore", so all three are silently dropped. The example demonstrates configuration that has no effect — users following it will believe these settings are honored. Either implement these as first-class fields on LunaEvaluatorConfig (looks intended for payload_field and on_error), or remove them from the example and README.

Comment thread engine/src/agent_control_engine/core.py Outdated
"message": self._truncated_message(result.message),
}
metadata = dict(result.metadata or {})
metadata["selected_data"] = data
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This unconditionally injects the raw selector output (prompts, tool args, model output, potentially PII or secrets) into evaluator metadata for every leaf control. The metadata flows out through EvaluationResponse.matches/errors/non_matches to API clients, and is also fanned out into OTEL span attributes by the SDK's otel_sink (every metadata key becomes agent_control.metadata.<key>). The server's _sanitize_control_match only redacts internal evaluator error strings — it doesn't touch metadata. No size cap, redaction, or opt-out. Gate selected_data behind a config flag with truncation, or extend the sanitizer to scrub it before exporting.

Comment on lines +7 to +10
dependencies = [
"agent-control-sdk",
"agent-control-evaluator-galileo",
]
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dependencies declares only agent-control-sdk and agent-control-evaluator-galileo, but [tool.uv.sources] lists six workspace packages. [tool.uv.sources] only overrides sources for already-declared dependencies — it does not add packages. The SDK imports agent_control_telemetry (and other workspace packages) at runtime, but those are not vendored in editable mode, so demo_agent.py's first import will fail when a user runs uv run python setup_controls.py as documented. See examples/google_adk_plugin/pyproject.toml for the correct pattern (all workspace packages listed under both dependencies and [tool.uv.sources]).

Comment thread examples/galileo_luna/README.md Outdated
Comment on lines +27 to +31
If the scorer requires explicit project resolution, set:

```bash
export GALILEO_PROJECT_ID="00000000-0000-0000-0000-000000000000"
```
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The README tells users to set GALILEO_PROJECT_ID for project-scoped scorer resolution, but LunaEvaluatorConfig has no project_id field and setup_controls.py places it at the top level where it gets dropped by extra="ignore". The instruction is misleading. Either implement the field end-to-end through the config and scorer payload, or remove this section.

Comment on lines +264 to +270
metadata={
"error": error_detail,
"error_type": type(error).__name__,
"scorer_label": self.config.scorer_label,
"scorer_id": self.config.scorer_id,
"scorer_version_id": self.config.scorer_version_id,
},
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Storing error_detail in metadata["error"] lets raw upstream error text bypass the server sanitizer. _sanitize_control_match redacts result.error and metadata["condition_trace"]["error"], but not top-level metadata["error"], so HTTP error bodies, exception messages with URLs, and internal hints leak through to API responses and telemetry. error_type is already in metadata, which is the part callers can safely act on — error_detail is redundant. Drop it, or replace with a safe error code.

Comment on lines +121 to +125
def _get_client(self) -> GalileoLunaClient:
"""Get or create the Galileo Luna client."""
if self._client is None:
self._client = GalileoLunaClient()
return self._client
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Evaluator instances are LRU-cached and reused across calls; the base Evaluator docstring explicitly warns against storing mutable request-scoped state on self. _get_client mutates self._client with no synchronization. There is no asyncio race within _get_client itself (no await), but multi-thread or multi-event-loop reuse can construct two clients on the first concurrent call, orphan one of them, and leak its httpx.AsyncClient connection pool. Auth already validates in __init__ — construct the client eagerly there too.

Comment on lines 226 to 228
metadata = dict(result.metadata or {})
metadata["selected_data"] = data
metadata["condition_trace"] = trace
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

metadata = dict(result.metadata or {}) then metadata["selected_data"] = data unconditionally overwrites whatever the evaluator deliberately put under selected_data. Evaluators do return that key (see the mock in engine/tests/test_core.py:160), and an evaluator might want to ship a sanitized or transformed version. The contract is not currently clear about who owns the key — consider namespacing the engine-injected value (e.g. engine_selected_data) so it cannot collide with evaluator-supplied metadata.

headers: dict[str, str] | None,
) -> tuple[str, dict[str, str]]:
request_headers = dict(headers or {})
if self.api_secret is None:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The endpoint/auth mode is selected implicitly from whichever credential is present: if an API secret is set, the client uses the internal route; otherwise it uses the public API-key route. If both credentials are present, the secret path wins silently. Please make this an explicit runtime/client mode, validate that the matching credential is present, and log the selected mode without logging secrets.


text = _coerce_payload_text(data)
scorer_label = self.config.scorer_label or ""
if "output" in scorer_label:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For scalar selected data, input vs output routing is inferred by checking whether the scorer label contains output. That is surprising behavior for a public evaluator contract, especially because the example config suggests there is an explicit payload_field knob. Please make the payload side explicit in config and test it, or document this heuristic very clearly.

if isinstance(score, list):
return threshold in score
if isinstance(score, dict):
if isinstance(threshold, str) and threshold in score:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For dict scores, contains currently matches both keys and values. That can trigger unexpectedly when the threshold appears as a field name rather than as the scorer value. Please either narrow this to values-only or document and test the intended semantics explicitly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants